Avoiding inferior clusterings with misspecified Gaussian mixture models

Clustering is a fundamental tool for exploratory data analysis, and is ubiquitous across scientific disciplines. Gaussian Mixture Model (GMM) is a popular probabilistic and interpretable model for clustering. In many practical settings, the true data distribution, which is unknown, may be non-Gaussian and may be contaminated by noise or outliers. In such cases, clustering may still be done with a misspecified GMM. However, this may lead to incorrect classification of the underlying subpopulations. In this paper, we identify and characterize the problem of inferior clustering solutions. Similar to well-known spurious solutions, these inferior solutions have high likelihood and poor cluster interpretation; however, they differ from spurious solutions in other characteristics, such as asymmetry in the fitted components. We theoretically analyze this asymmetry and its relation to misspecification. We propose a new penalty term that is designed to avoid both inferior and spurious solutions. Using this penalty term, we develop a new model selection criterion and a new GMM-based clustering algorithm, SIA. We empirically demonstrate that, in cases of misspecification, SIA avoids inferior solutions and outperforms previous GMM-based clustering methods.


Background
Let f (x; θ) be the density of a K-component mixture model.Let f k denote the kth component density with parameters θ k and weight π k .The density of the mixture model is given by f (x; θ) = K k=1 π k f k (x; θ k ), where K k=1 π k = 1 and π k ≥ 0 for k = 1, . . ., K and θ denotes the complete set of parameters.In a GMM, each indi- vidual component f k is modeled using a multivariate Gaussian distribution N (µ k , � k ) where µ k and k are its mean and covariance respectively.Appendix A lists all the symbols used herein.
Given n independent and identically distributed (iid) instances of p-dimensional data, [x ij ] n×p where index i is used for observation, and j is used for dimension, Maximum Likelihood Estimation (MLE) aims to find parameter estimates θ from the overall parameter space of f (θ) such that probability of observing the data samples x 1 , . . ., x n is maximized, i.e., θ = arg max θ ∈� L(θ ) , where, L(θ ) = 1 n i log f (x i ; θ) is the empirical expected loglikelihood.
Following 36 , if the observed data are n iid samples from a probability distribution P(η * ) (where η * denotes the true set of parameters) and the fitted model has the same functional form P(.), then the model is said to be correctly specified.Otherwise, the model is said to be misspecified.Note that when the number of dimensions are greater than the number of datapoints ( p > n ), there are not enough datapoints to determine if the fitted model and data-generating distribution are parameterized by the same model, even along a single dimension and so, the notion of misspecification is moot.Appendices B and C give an overview of, respectively, spurious solutions and likelihood-based model selection criteria (such as Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC)).
Recent reviews on Automatic Differentiation (AD) can be found in 37,38 ; a brief review is in Appendix D. To obtain MLE of GMMs, EM elegantly solves 3 problems: (1) Intractability of evaluating the closed-forms of the derivatives, (2) Ensuring positive definiteness of the covariance estimates ˆ k , and (3) Ensuring the con- straint on the component weights ( k πk = 1 ).Problem 1 is inherently solved by the use of AD.Problems 2 can be addressed through simple reparametrizations in the form of matrix decomposition 34,39 .Problems 3 can be

Simulation study
We study the clustering solutions obtained by fitting misspecified GMM on Pinwheel data, also known as warped GMM, with 3 components and 2 dimensions 40 .Pinwheel data is generated by sampling from Gaussian distributions, and then stretching and rotating the data.The centers are equidistant around the unit circle.The variance is controlled by parameters r, the radial standard deviation and t, the tangential standard deviation.The warping is controlled by a third parameter, s, the rate parameter.Thus, the extent of misspecification (i.e., the deviation of the data from the assumed Gaussian distributions in GMM) can be controlled using these parameters.An example is shown in Fig. 1.We generate 1800 Pinwheel datasets with different combinations of parameters.In addition, we also simulate 1800 3-component, 2-dimensional datasets from GMM with varying overlap of components (to analyze as a control case where there is no misspecification in fitting a GMM).For each dataset we obtain two clustering solutions, one each using EM and AD-GD.We run both the algorithms till convergence, using the same initialization and stopping criterion.We use ARI to evaluate performance where higher values indicate better clustering.Details are in Appendix E.
Our experiments on these 3600 datasets show that EM outperforms AD-GD in both cases-when there is misspecification and no missspecification.However, when there is misspecification, we find that for both EM and AD-GD, there are many inferior solutions that have low ARI and unequal fitted covariances, often with one fitted component having relatively large covariance, resulting in a high degree of overlap between components.We illustrate this using Pinwheel data generated with parameters r = 0.3 , t = 0.05 and s = 0.4 .Both EM and AD-GD are run with 100 different initializations.We group the solutions into 4 sets based on AIC as shown in Table 1, which also shows the average AIC and ARI obtained by AD-GD and the number of EM and AD-GD  www.nature.com/scientificreports/solutions in each set.Table 27 (in Appendix E) show the statistics for EM solutions which are very similar.We visualize the clustering for one solution from each of these sets in Fig. 1.We observe that both EM and AD-GD obtain very similar solutions in terms of AIC and ARI for this dataset.The best average ARI is that of set 2 (row 2 of Table 1).These solutions are obtained in less than 5% of the cases.The overall best ARI, of 0.941, is from a solution in set 4 that has a high AIC value of 813.4 as shown in Fig. 1d.The best AIC, of 771.9, is obtained by a solution from set 1 which has considerably lower ARI of 0.625 as shown in Fig. 1a.Thus, we see solutions with high likelihood and low ARI.We observe that there are many such inferior solutions in sets 1, 3 and 4 having a fitted component with large variance, also seen in the specific solutions in Fig. 1.
In cases of misspecification, inferior clusterings from both EM and AD-GD, are found to have fitted covariances that differ considerably in their orientations and sizes.We observe this in the solutions in Fig. 1 and through the summary statistics in Table 1.We find that such inferior solutions in misspecified models occur frequently with many different initializations, and typically when there is a component with large variance.This is different from the characterization of spurious solutions (see Appendix B) that are found to occur rarely, only with certain initializations and due to a component with small variance 1 .
We find similar inferior solutions with low ARI and low AIC in cases where Gaussian components are contaminated by a Student's-t distribution and random noise (details in Appendix F).Further, we have observed similar effects of misspecification in higher dimensions as well.Illustrations in datasets of up to 100 dimensions are in Appendix F.

Misspecification and asymmetric components
We now theoretically examine how asymmetry of fitted components varies with misspecification in the case of univariate mixtures, following the setting of 5,6,41 .
Let the true data generating distribution be G , with 0 < π ≤ 0.5 .Without loss of generality, we assume µ > 0 .When π = 0.5 , the true distribution is a symmetric 2-component GMM, with the component means equidistant on either side of the real line as shown in Fig. 2a.In this case, it has been shown that under some identifiability conditions, fitting a 2-component GMM (i.e., without any misspecification) on the data sampled from the true distribution using MLE leads to symmetric fitted components whose parameters converge to that of the true distribution 32,36 .As π is reduced from 0.5, an additional Gaussian component with mean µ (that coincides with one of the component means) and variance b 2 σ 2 is introduced.Fitting a 2-component GMM in the case when π < 0.5 leads to a misspecified model.We analyze the asymmetry of the fitted components for π < 0.5 with varying misspecification (b).
Let the fitted misspecified distribution be Note that this is a projection to the fitted model and is the best possible estimator 5 .Let erf be the Gauss error function 42 which is an odd function, whose values lies in (−1, 1) and is related to the CDF of a standard normal distribution as (x) . When misspecification is small, the means of fitted components μ1 , μ2 typically have opposite signs, as the true components flank the y-axis.When there is misspecification, we expect erf μ1 to have unequal and opposite signs.We define an asymmetry coefficient:

σ1
). www.nature.com/scientificreports/τ ( θ) measures a form of asymmetry in µ σ of the fitted components when there is misspecification.When there is no misspecification, since MLE estimates converge to the true parameters asymptotically, τ ( θ) also converges to zero.However, when there is misspecification, the variance of one of the components, say σ 1 , tends to be much larger than that of the other component (as seen in Table 1); in such cases, erf µ 1 √ 2σ 1 would be much closer to zero compared to the error function for the other component and τ ( θ) would converge to a non-zero value.We derive bounds on τ ( θ) as follows (proof in Appendix G): Theorem 1 Let the true data generating distribution be G . Then, Note that C 2 and C w depend only on the true parameters (π, µ, σ , b) and are constants with respect to the fitted model.Assuming that the true parameters are known, these bounds provide a certification on whether the fitted parameters θ indeed correspond to the maximum likelihood, thus helping to filter out fitted parameters corresponding to spurious solutions and undesirable local optima.
We empirically compare the upper bound and observed value of τ ( θ) when µ = 5.0, σ = 10, π = 0.35 by varying the values of b (the behaviour of the lower bound is qualitatively similar).For a given value of b ∈ {0.1, 0.2, . . ., 6} , we simulate 50 datasets and pick the maximum of τ ( θ) among these 50 output solutions.The upper bound and the observed maximum are plotted in Fig. 2b.
At around b = 1 (when there is no misspecification), the upper bound reaches its minimum.As b moves away from 1, the upper bound increases which illustrates that as misspecification increases, the asymmetry in the fitted components also increases.Additional plots for other values of (µ, σ , π) are in Appendix G.We observe that, if one of the components has a relatively large variance, then many datapoints may be wrongly assigned to this component leading to poor clustering.

A penalized clustering method A new penalty term
Our analysis in the previous section naturally leads to the goal of reducing the asymmetry in the fitted components to improve clustering with misspecified GMMs.To this end, we develop a new functional penalty term that (i) penalizes differences in orientations and sizes of the fitted components to avoid inferior solutions and (ii) bounds the penalized likelihood to avoid spurious solutions during ML estimation.
Our penalty term is based on the KL-divergence between component Gaussians.Let N k denote the multivariate Gaussian distribution N (µ k , � k ) .The KL-divergence KL(N 1 , N 2 ) is given below where each term can provide useful penalties: A Penalizes the difference in size of the covariance matrices of the two components.Even if the directions of principal axes of the two covariance matrices are different, this term will be zero if the component determinants are exactly the same, since log(1) = 0. B Penalizes the difference in orientations, i.e., if the directions of principal axes of the component covariance matrices are vastly different.When 1 and 2 are equal, this penalty term becomes zero.C Penalizes the assignment of a single cluster to faraway outlier points which have an extremely low likelihood of being observed.If some outlier points are assigned a single cluster, then the cluster center, µ 1 , would be different from the cluster center µ 2 of non-outlier data, and so, the Mahalonobis distance between the cluster centers, �µ 1 − µ 2 � , would be high, with higher penalization.KL-divergence is not symmetric about 1 and 2 .If there is an order of magnitude difference in the covariance matrices 1 and 2 , it can be detected through the values of KL(N 1 , N 2 ) and KL(N 2 , N 1 ) .The differ- ence in their values primarily stems from the difference between the terms , and tr{ −1 2 1 } and tr{ −1 1 2 } .The difference in these two KL divergence values pro- vides signal about the overlap or asymmetry of the two Gaussians.This notion is generalized to a K-component GMM, through the combinatorial KL divergences, KLF (forward) and KLB (backward), where Well separated equal-sized clusters typically have similar values of KLF and KLB (By well-separated we refer to the case where the means of the Gaussian distributions representing the two clusters are far apart from each other relative to their spreads (as dictated by their respective covariance matrices).In this scenario, the values of both KLF and KLB are both dominated by the term B and term C i.e. the trace term and the Mahalonobis distance term.The difference between KLF and KLB will be mainly due to the differences in terms B, C between the pairwise KL-Divergences.Now consider a scenario, where the spread of one of the cluster is much higher compared to the other one.In this not-well-separated case, the terms B,C will differ more significantly between the pairwise KL-Divergences.This effect will be later seen clearly in Fig. 3 where the KLF and KLB values are high when the clusters not well separated.).We denote both the values by KLDivs = {KLF, KLB}.We note that these two sums (KLF + KLB) together give the sum of Jeffrey's divergence between all components 43 .In the clustering outputs shown in Fig. 1, we see that in solution (c) from set 3, where clustering is poor, KLF= 258 and KLB= 494 (the difference is high), while in solution (d) from set 4, which has better clustering, KLF= 127 and KLB= 64 (with low difference).
Our proposed penalty term is a weighted sum of the KLF and KLB terms: −w 1 × KLF − w 2 × KLB , with negative weights (−w 1 , −w 2 ) .With positive weights GD will further shrink the smaller clusters.Negative weights lead to reduced overlap as well as clusters of similar volume.In our experiments, we found that choosing w 1 = w 2 works in almost all cases.Further, choosing w 1 = w 2 makes the penalized objective invariant to permutation in the component labels.In general, we can say that for all positive values of w 1 , w 2 , the weighted penalty term (made up of KLF, KLB) will be always positive as both KLF and KLB are summation of KL divergences.Alternatively we also take a closer look at the individual terms that make up KLF and KLB.Term C is the Mahalonobis term which will always be positive.Term A+B is known as the Burg Matrix Divergence or the LogDet Divergence, which is again guaranteed to be positive 44 .However, the same cannot be said for the individual terms A and B separately.E.g., if we were to consider the contribution of the log determinant terms towards the penalty in isolation, it is possible that the aggregate of the log determinant terms can be negative.As a minor side point, we also want to point out that when we set w 1 = w 2 , as we did in our experiments, the log determinant terms cancel out each other in the final expression and thus, only terms B and C would contribute to the penalty.
Optimization of likelihood with these penalty terms can be implemented effortlessly through AD-GD where gradients in closed forms are not required.However, the use of such complex penalties is difficult within EM.We cannot obtain closed forms of the covariance estimates.Closed forms for the mean update can be derived (Appendix H) but is laborious and each mean update depends on means for all other components, and hence cannot be parallelized in EM.

Sequential initialization algorithm (SIA)
SIA consists of two steps.In Step I of SIA, we use the loglikelihood L as the objective and run EM or AD-GD to fit a GMM.Typically, the output at the end of Step I will have unequal KL-divergences for misspecified models.The parameters at the end of Step I are used to initialize the algorithm in step II where we modify the objective function to: After the second optimization step the likelihood decreases but the KL-divergence values, KLF and KLB, come closer.The complete algorithm is presented in Algorithm 1.
In SIA, after Step I we know the likelihood and model parameters of the unpenalized fitted model.In Step II, we choose w 1 , w 2 in such a way that the overall likelihood of the model doesn't reduce drastically, to find solutions (with less dissimilar covariances) close to the solution after Step I.In our experiments, the values of w 1 , w 2 have been kept equal and chosen from {0, 0.25, 0.5, 1, 1.25} using the MPKL criterion (described below).

SIA objective is bounded
We consider maximum likelihood estimation of parameters θ of a two-component GMM, from n samples, each of p-dimensions, using SIA.We prove that our penalized log-likelihood M is bounded, and thus SIA alleviates the problem of degeneracy and spurious solutions in unconstrained GMMs.
Theorem 2 Let C denote a constant dependent only on p, n, w 1 , w 2 , c and independent of θ .Assume that the spectral norm of the generalized variances obtained using MLE, 1 , 2 , is bounded by c < ∞ (as a regularity condition) and, without loss of generality, assume | 1 | ≤ | 2 | and that the loglikelihood L → ∞ when the first component www.nature.com/scientificreports/collapses on a datapoint, i.e., | 1 | → 0 .For non-negative weights w 1 , w 2 , and any iteration t, SIA objective function M is always bounded: The proof, in Appendix G, also clarifies that the result generalizes to arbitrary number of components.The bound is given in terms of w 2 because we assume | 1 | ≤ | 2 | (the objective is not symmetric with respect to w 1 and w 2 ).Hence, w 1 does not appear in the theorem statement.Also note that, w 1 and w 2 are the only controllable hyper-parameters.By setting w 2 = 0 , we can see that the maximum of unpenalized likelihood is unbounded.If a collapses, the log-likelihood may increase but the trace term and the determinant in the penalization will explode.Hence, by having negative weights −w 1 , −w 2 we can ensure that M is bounded and control the behavior of the algorithm.Note that both | 1 |, | 2 | will not collapse to zero in a maximum-likelihood estimation approach because if both collapse the log-likelihood goes to negative infinity (which is easy to verify).A visualization of the likelihood surface, with and without penalty terms, is shown in Appendix I.The fact that there exists an upper bound to the penalized likelihood (even when the generalized variance of one of the components collapses to zero), irrespective of its tightness, is sufficient to show that SIA avoids degenerate and spurious solutions.

MPKL: a model selection criterion
We define the Maximum absolute Pairwise difference between KL divergence values (MPKL) for a K-component GMM as: MPKL is an indicator of how well the clusters are separated.It is invariant to permutation in cluster labels and can be used as a criterion for selecting the number of clusters.For a chosen range of number of clusters ( 2, . . ., L ), we compute MPKL for each value and choose K that minimizes MPKL: argmin K∈[2,...,L] MPKL .Note that MPKL criterion is independent of SIA and can be potentially useful for any GMM-based method, even in the absence of suspected misspecification.In our experiments (Evaluation on real datasets and Appendix L.2), we show that use of MPKL aids both GD and EM based methods.

Computational complexity
The computational complexity is dominated by evaluating KLF and KLB which involves O(K 2 ) inversion steps ( O(p 3 ) ).Therefore, the overall time complexity of both SIA and MPKL is O(K 2 p 3 ) .This is comparable to most GMM inference methods (e.g., EM) due to the O(p 3 ) matrix inversion step.Wall-clock times of EM, AD-GD and SIA implementations are compared in Appendix J.

Illustrative examples
Figure 3 shows the clustering solution (set 3) obtained on the Pinwheel data discussed in Inferior clustering solutions, after steps I and II of SIA.After step II, compared to the clustering after step I, the likelihood decreases, both the KLF and KLB values decrease, the ARI increases, and the clusters have less overlap.The same is observed for the other three sets of clustering solutions.Similar illustrations on datasets contaminated with Student's-t and random noise are in Appendix F.

Simulation studies
We simulate over 2000 datasets from mixtures with non-Gaussian components and with varying (i) cluster separation (ii) covariance structures and imbalance across component sizes and (iii-iv) number of components.In all these settings, we vary the dimensionality as well.Table 2 shows a summary of all the settings and the sections in which the experiments are discussed.Experiments to evaluate MPKL as a model selection criterion are in Appendix L.2.   www.nature.com/scientificreports/data points ( N 1 × p ) in cluster 1 constant at 100 × p , we vary the number of datapoints ( N 2 × p ) in cluster 2 as {20 × p, 50 × p, 100 × p} .We simulate 50 different datasets for each setting.We vary the dimensionality p as {2, 3, 5, 10, 20} .The results are given in Tables 8, 9 and 10.We observe that when the dimensionality is low, clustering performance is better at higher imbalance.Appendix K shows an illustration.At higher values of p, clustering performance is better at lower imbalance.Overall, we find that SIA performs on par or better than EM, AD-GD and MClust.
The datasets are diverse with respect to values of n, p and K and such that their underlying clusters do not appear to be Gaussian or well-separated (as seen through their scatter-plots).EM, AD-GD, MClust and SIA are Table 5. ARI (mean and SD) on varying and p = 5.Significant values are in bold.initialized using K-Means, with 10 different random initializations within K-Means, and the best result obtained is reported.
Table 11 shows the dataset statistics and the ARI obtained (using the provided ground truth) by each method.
In datasets IV and V, the number of observations is lesser than the dimensions ( n < p ), EM fails to run and AD-GD assigns all the points to a single cluster.

Case study
The Wine dataset contains the results of a chemical analysis of wines, yielding 13 features, from 3 different cultivars.We use this dataset, without the labels of wine types, and fit EM, AD-GD, MClust and SIA for K ∈ {2, 3, 4} .The log-likelihood (LL), MPKL, KLDivs, AIC and ARI values are shown Table 12.For each K we show the results for multiple runs with different initializations.
For K = 2 , among the 3 EM solutions, while clustering S1 has better LL, its MPKL is higher (worse) than those obtained by the other clustering outputs (S2 and S3).So, even among EM solutions, by trading off likelihood for an improvement in MPKL, we can choose better clustering solutions with higher accuracy.The LL of SIA is better than that of AD-GD (which was used in step 1 of SIA) indicating that its penalty term can aid in escaping local maxima.The ARI from SIA (S5) is comparable to the best results from EM.We observe similar trends for K = 3 .Among EM solutions (S6 and S7), the solution with better MPKL (S7) achieves a better ARI.Again, the LL of SIA (S9) is better than that of AD-GD (S8), which was used in step 1 of SIA.The best ARI is achieved by SIA which also has the lowest MPKL value.For K = 4 , we observe that AD-GD solution (S12) does not have high LL, nor does it have a low MPKL value.When SIA step-2 is initialized with the output of EM (S13), we observe a decrease in MPKL and an improvement in ARI over both EM and AD-GD.The confusion matrices of models with the best MPKL values for each K is in Table 13.Misspecification of the components is evident in the data scatterplot and S9 visually has the most well separated clusters (both in Appendix M).

Conclusion
In this paper, we illustrate and discuss the problem of inferior solutions that occur while clustering with misspecified GMMs.They differ in their characteristics from spurious solutions in terms of the asymmetry of component orientation and sizes, and frequency of occurrence.Our theoretical analysis highlights a new relation between such asymmetry of fitted components and misspecification.Further investigation of this connection would be interesting to explore in the future.
We propose a new penalty term based on the Kullback Leibler divergence between pairs of fitted components that, by design, avoids inferior solutions.We prove that the penalized likelihood is bounded and avoids degeneracy.Gradient computations for this penalized likelihood is difficult but Automatic Differentiation can be done effortlessly.We develop algorithms SIA for clustering and MPKL for model selection, and evaluate their performance, on clustering synthetic and real datasets with misspecified GMMs.The superiority of SIA for misspecification is primarily demonstrated through extensive experimentation.Our simulation study reveals that SIA works well in several cases of misspecification that we examine, particularly when the cluster separation is low.At high cluster separation, the effect of misspecification on clustering is less acute and other methods, like MClust, also perform well, and the performance of SIA is comparable.Although the penalty term in SIA is designed to penalize component asymmetry, SIA performs well even when there are significant differences with respect to orientations and sizes of the true components in our simulation study.This is because SIA does  www.nature.com/scientificreports/not use hard constraints on the orientations and sizes and the functional penalization approach enables a datadependent discovery of the underlying clusters.Nevertheless, it is possible in some cases for SIA to fit symmetric components when the underlying clusters have covariances with different orientations and sizes.The use of our proposed MPKL criterion in conjunction with other likelihood-based criteria like AIC or BIC, as discussed in our case study, can guide the user to find good clustering solutions in such cases.Statistical guarantees with this penalized likelihood approach and extensions to high-dimensional settings require further work and would be of theoretical and practical interest.

Data and code availability
Python implementation of our algorithms and experiments available at https:// bitbu cket.org/ cdal/ sia/.

Figure 1 .
Figure 1. 4 clustering solutions obtained with AD-GD using different initializations on Pinwheel data; Sets refer to groups inTable 1; in parentheses: (ARI, AIC), best values in bold.

Figure 2 .
Figure 2. (a) True distribution is a GMM and a contamination .We fit a 2-component (misspecified) GMM .(b) Empirical comparison of τ ( θ) and its upper bound with varying b.

Table 1 .
Summary statistics of clustering solutions over 100 random initializations on the Pinwheel dataset (see Fig.1), grouped into 4 sets based on AIC ranges.Mean (Standard Deviation) of parameter estimates from AD-GD, EM estimates (in Appendix E) are similar.

Table 3 .
ARI (mean and SD) on varying and p = 2. Significant values are in bold.

Table 6 .
ARI (mean and SD) on varying and p = 10.Significant values are in bold.

Table 7 .
ARI (mean and SD) on varying and p = 20.Significant values are in bold.

Table 8 .
ARI (mean and SD) for different p and N 2 = 20.Significant values are in bold.

Table 9 .
ARI (mean and SD) for different p and N 2 = 50.Significant values are in bold.

Table 10 .
ARI (mean and SD) for different p and N 2 = 100.Significant values are in bold.

Table 11 .
Clustering performance (ARI and RI) on ten real datasets.Significant values are in bold.

Table 12 .
Clustering results for wine dataset.Significant values are in bold.